Table of Contents

Data Description

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Task: Estimate the price based on given features.

Model Introduction

The term Boosting refers to a family of algorithms which converts weak learner to strong learners.

There are many boosting algorithms which impart additional boost to model’s accuracy. In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient Boosting (GBM) and XGboost.

Generally XGboost is considered more advanced than gbm.

Notes

Imports

Important Scripts

Load the data

Feature selection

Log transform large values

Train-Test split after log transform

Train-Validation Split

Scaling the Data

Modelling xgboost

https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

Parameters:
-------------
max_depth=3
learning_rate=0.1
n_estimators=100 # number of trees you want to build.
verbosity=1 **NOTE: it print in ipython terminal not in browser
silent=None **deprecated use verbosity
objective='binary:logistic' **for binary classification
booster='gbtree' **use default tree not linear even for regression (may also use dart instead of gbtree, but needs to tune)
n_jobs=1 **make this -1
nthread=None **deprecated use n_jobs
gamma=0 # A higher value leads to fewer splits.
min_child_weight=1
max_delta_step=0
subsample=1 # percentage of samples used per tree. Low value can lead to underfitting.
colsample_bytree=1 # percentage of features used per tree. High value can lead to overfitting.
colsample_bylevel=1
colsample_bynode=1
reg_alpha=0 # A large value leads to more regularization.
reg_lambda=1 # L2 regularization on leaf weights and is smoother than L1 regularization.
scale_pos_weight=1
base_score=0.5
random_state=0 **use your own random state
seed=None      **deprecated use random_state
missing=None

Early stopping:

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there’s more than one, it will use the last.

train(..., evals=evals, early_stopping_rounds=10)

The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.

If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. Note that xgboost.train() will return a model from the last iteration, not the best one.

This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in param['eval_metric'] is used for early stopping.

If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_ntree_limit:

ypred = bst.predict(dtest, ntree_limit=bst.best_ntree_limit)

Faster implementations

Flexible tree growing policies The existing tree grower in xgboost grows a tree in a depth-wise fashion, executing splits in first level before splits in second and so forth. The new grower lets you control the way new nodes are added to the tree:

grow_policy=depthwise (default): split at nodes closest to the root, i.e. grow depth-wise. grow_policy=lossguide: split at nodes with highest loss change. This behavior mimics that of LightGBM. It has been reported that the lossguide policy often results in faster convergence in loss, though there is also risk of over-fitting(see the preliminary results).

Feature Importance

NOTE: `model.feature_importance` gives important features based on `information type gain`. `xgb.plot_feature_importance(model)` gives important features based on `weights`.

Xgboost using DMatrix

Using dmatrix saves RAM, when using xgboost it is good to use xgboost dataset instead of pandas dataframes or numpy arrays.

Cross Validation

Modelling xgboost using dtrain

Xgboost has its own dataset type which is much more memory efficient. Use xgboost dataset and delete pandas dataframe if it is not required to save the memory.

When using xgb.DMatrix always use feature_names parameter, so that when getting feature importance, we get the names of the features not names like f1, f2, f3 etc.

Model Evaluation Using shap

HPO (Hyper Parameter Optimization)

We generally should optimize model complexity and then tune the convergence.

model complexity: n_estimators, max_depth etc
convergence: learning rate

Parameters:

Regularization parameters:

HPO Model Complexity

Best Model

Time Taken